Replication file for "Known Unknowns: Media Bias in the Reporting of Political Violence," by Nick Dietrich and Kristine Eck.

Questions about these replication materials can be directed to Nick Dietrich at dietrich.nicholas@gmail.com.

Analyses conducted in R version 3.6.1.

-------------
--CONTENTS---
-------------

Main folder

ged_replication.csv
-The dataset used in the analysis. Based on the UCDP Georeferenced Event Dataset version 17.1. Documentation describing how UCDP collects data is available at https://ucdp.uu.se/downloads/.

variable_importance.csv
-Variable importance scores from 200 random forests. Used to generate Figure 1.

partial_dependence.csv
-Partial dependence for each variable from 200 random forests. Used to generate Figures 2, 4, and 5.

known_unknowns_replication.R
-R script that generates the tables and figures in the text and the appendix. 

-------------------
--RANDOM FORESTS---
-------------------

random_forests subfolder

This folder contains the code needed to generate the variable_importance.csv and partial_dependence.csv files. Run these only if you need to recreate variable_importance.csv and partial_dependence.csv from scratch. They take a VERY long time to run and require high-powered computing access. Note that syntax for submitting jobs may differ depending on your institution's system.

ucdp_cforest.R
-An R script for running the random forests and saving the results. This script runs 20 random forests and saves the resulting variable importance and partial dependence scores. This script assumes that your system has 10 cores available and runs the random forests in parallel to reduce processing time. To perfectly replicate the results in the text, this script needs to be run 10 times, each with a different random seed passed from ucdp_cforest.sh, to equal 200 random forests. ged_replication.csv needs to be in the directory where you run the script.

ucdp_cforest.pbs
-A batch script that directs the high-powered computing system to run ucdp_cforest.R. This file specifies the number of nodes/cores, the amount of memory, the user, and the maximum allowed runtime.

ucdp_cforest.sh
-A shell script that runs ucdp_cforest.pbs 10 times, each time passing a different random seed. Running this file once will produce the 200 random forests in the text.

cforest_combine.R
-An R script that combines the output from each random forest into two files: variable_importance.csv and partial_dependence.csv. Run this AFTER running ucdp_cforest.R.

cforest_combine.pbs
-A batch script that directs the high-powered computing system to run cforest_combine.R. This file specifies the number of nodes/cores, the amount of memory, the user, and the maximum allowed runtime.
